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Review 

Stepping stones in DNA sequencing 



Henrik Stranneheim 1 ' 2 and Joakim Lundeberg 1 

1 Science for Life Laboratory, KTH Royal Institute of Technology, Stockholm, Sweden 

2 Department of Molecular Medicine and Surgery, Karolinska Institutet, Stockholm, Sweden 



In recent years there have been tremendous advances in our ability to rapidly and cost-effectively 
sequence DNA. This has revolutionized the fields of genetics and biology, leading to a deeper un- 
derstanding of the molecular events in life processes. The rapid technological advances have enor- 
mously expanded sequencing opportunities and applications, but also imposed strains and chal- 
lenges on steps prior to sequencing and in the downstream process of handling and analysis of 
these massive amounts ofsequence data. Traditionally, sequencing has been limited to small DNA 
fragments of approximately one thousand bases (derived from the organism's genome) due to is- 
sues in maintaining a high sequence quality and accuracy for longer read lengths. Although many 
technological breakthroughs have been made, currently the commercially available massively par- 
allel sequencing methods have not been able to resolve this issue. However, recent announce- 
ments in nanopore sequencing hold the promise of removing this read-length limitation, enabling 
sequencing of larger intact DNA fragments. The ability to sequence longer intact DNA with high 
accuracy is a major stepping stone towards greatly simplifying the downstream analysis and in- 
creasing the power of sequencing compared to today. This review covers some of the technical 
advances in sequencing that have opened up new frontiers in genomics. 
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1 Introduction 

Fully understanding the language of DNA requires the 
complete determination of the order of bases in the 
genome of humans (or other organisms of interest). Ac- 
quiring that knowledge promises to yield more complete 
insights into biological variations and etiology of dis- 
eases. In the dawn of sequencing, reading the order of the 
four bases of DNA was a cumbersome process. Sequenc- 
ing was made possible, although still difficult, by the in- 
troduction of the chemical degradation method of Maxam 
and Gilbert [1] and Sanger's chain-termination sequenc- 
ing method [2] at the end of the 1970s. The latter proved 
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to be more useful and was to be the dominant DNA 
sequencing technique for almost three decades, propelled 
by the Human Genome Project (HGP), and is still consid- 
ered by many the "gold standard". The commercial launch 
of massively parallel DNA sequencing instruments in 
2005 initiated a paradigm shift powered by new DNA 
sequencing techniques that inspired researchers to 
address bolder questions in genome-wide experiments. 
Recent years have seen many new contenders in the field 
of massively parallel sequencing. Notably, innovative nov- 
el sequencing methods using single molecules of DNA 
and real-time detection have emerged. Currently, these 
novel approaches are complementing existing sequenc- 
ing platforms, but they have a long way to go before 
replacing massively parallel sequencing. 

The most commonly used DNA sequencing technolo- 
gies, and their challenges and limitations, are described 
below. A summary table of the features of each technolo- 
gy is provided in Table 1 while the number of sequence 
reads and read lengths are shown in Fig. 1. 
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Figure 1. (A) Changes of read length and degree of parallelism in sequencing technologies since the 1990s up to the present. (B) Number of reads and 
read length per sequencing technology. 
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2 Sanger sequencing - chain-termination 
sequencing 

The ingenuity of the chain-termination technique lies in 
the use of chain-terminating nucleotides, dideoxy- 
nucleotides that lack a 3'-hydroxyl group, restricting fur- 
ther extension of the copied DNA chain. Early Sanger se- 
quencing required the sequencing process to be split into 
four separate reactions. Each reaction involves a single- 
stranded DNA template, a DNA primer and a DNA poly- 
merase in the presence of a mixture of the four unmodified 
nucleotides, one of which is labeled, and a type of modi- 
fied chain-terminating nucleotide. Fragments of varying 
length are synthesized after primer hybridization and poly- 
merase extension, all having the same 5' end but termi- 
nated by a chain-terminating nucleotide at the 3' end. 
Adding only a fraction of the terminating nucleotide en- 
sures the random incorporation of dideoxynucleotides in 
only a small subset of molecules. Since different chain- 
terminating nucleotides are used in the four sequencing 
reactions, all combinations of termination can be pro- 
duced. The generated 3'-terminated DNA templates are 
then heat-denatured and fractionated by gel-electropho- 
resis, running products of all four sequencing reactions in 
parallel [2]. Using radioactive or, more recently, fluorescent 
labeling to visualize the bands enables the sequence of the 
original DNA template to be determined by following the 
migration order of successively larger fragments in the gel. 



Several enhancements have been made to the original 
method developed by Sanger. These include: labeling the 
chain-terminating nucleotides with spectrally distinct flu- 
orescent dyes enabling: a single tube and lane to be used 
in the fragment generation and fractionation steps, re- 
spectively [3,4]; elimination of the need to cast gels by us- 
ing capillary gel electrophoresis [5, 6]; and automation of 
the protocol, leading to increases in parallelization, repro- 
ducibility and throughputs [7] . Sanger sequencing is still 
widely used today for many applications, particularly val- 
idation of genetic variants and in cases where high qual- 
ity reads of 300-900 bases are needed. However, the ma- 
jor advances in sequencing technology in recent years 
have not been related to the mature Sanger sequencing 
method, but to the rapidly evolving massively parallel se- 
quencing methods. 

3 Massively parallel sequencing - 
consensus sequencing 

The most widely used sequencing platforms in genetic re- 
search today are the massively parallel sequencing plat- 
forms, which have inherited many features from Sanger 
sequencing, such as the use of polymerases for synthesis, 
modified nucleotides and fluorescent detection. Another 
feature is that they require the DNA to be clonalfy ampli- 
fied forming a consensus template prior to sequencing. 
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These sequencing methods have been called next gener- 
ation sequencing, high-throughput sequencing methods, 
or second-generation sequencing. However, as the se- 
quencing technologies continue to develop, it is hard to 
keep referring to old and novel emerging sequencing 
technologies as belonging to a certain generation. There- 
fore, it is preferable to categorize them by their most 
prominent common feature, such as being massively par- 
allel or single-molecule sequencing methods. Currently, 
there are five competing massively parallel sequencing 
technologies, each with specific strengths and weak- 
nesses. These are described and discussed below. 

3.1 Pyrosequencing 

Melamede originally outlined the concept of sequencing- 
by-synthesis (SBS) in 1985, in a report of efforts to detect 
nucleotide incorporation events by measuring nucleotide 
absorbance [8]. Unaware of Melamede's findings, Nyren 
conceived another SBS approach using bioluminescence 
instead of absorbance in 1986 [9], which led (after anoth- 
er 12 years of experiments) to the introduction of pyro- 
sequencing [10]. Pyrosequencing technology was able to 
complement Sanger sequencing and was further devel- 
oped by 454 Life Sciences, founded by Jonathan Roth- 
berg. In 2005, Rothberg and colleagues [11] released the 
first proof-of-concept paper demonstrating a massively 
parallel pyrosequencing approach, yielding a tremendous 
increase in sequence capacity, and thereby transforming 
the way sequencing was conducted. 

In SBS, the event that is detected is the incorporation 
of nucleotides into growing DNA strands. As the nu- 
cleotides are incorporated by the polymerase, pyrophos- 
phate (PPi) and protons are generated. In pyrosequenc- 
ing, the PPi is used in an enzymatic cascade to generate 
a light burst. The PPi is converted by ATP sulfurylase to 
ATP, which is then used by the firefly enzyme luciferase 
to generate photons, providing a light signal that is pro- 
portional to the number of nucleotides being incorporat- 
ed. The challenge is to detect the flashes of light from 
each unique DNA template while sequencing numerous 
templates in parallel. Spatially separating each sequenc- 
ing reaction on beads deposited in small wells on a pico- 
titer plate elegantly solved this problem. 

Massively parallel pyrosequencing begins with frag- 
mentation of the DNA and adaptor ligation. Single-strand- 
ed DNA templates are then bound on beads and emulsion 
PCR is performed, clonally amplifying each DNA template 
in aqueous microreactors isolated by oil. The emulsion is 
then broken and the beads carrying DNA are separated 
from empty beads in a process called enrichment. The en- 
riched beads are deposited in the small wells on the pi- 
cotiter plate together with primer and DNA polymerase. 
Ideally, only one of these beads will fit into each well. 
Smaller beads are also added, carrying the enzymes re- 
sponsible for the light generation using PPi. Nucleotides 



are then passed over the substrate in a laminar flow of 
solutions applied in a predetermined order, and the flash- 
es of light in each well representing incorporation events 
are recorded. Efficient removal of reaction by-products, 
which could otherwise perturb the sequencing reaction, 
is facilitated by the laminar flow. Hence, the sequence of 
the DNA templates is determined from the knowledge of 
their location, the order of the flow of nucleotides and 
records of each flash of light from each well [10-12]. The 
major drawback using this approach are difficulties in 
sequencing stretches of identical nucleotides (homopoly- 
meric regions) longer than approximately five nucleotides 
due to the nonlinear light response they generate [10]. 

The comparable read lengths and increased sequence 
capacity compared to traditional Sanger sequencing 
made the massively parallel pyrosequencing ideal for 
de novo sequencing, re-sequencing of genomes and 
metagenomic studies [12]. James Watson, who con- 
tributed to solving the structure of DNA, had his genome 
sequenced by massively parallel pyrosequencing in 2008 
[13]. The major impact of this sequencing platform can be 
realized by comparing the sequencing of Watson's 
genome - completed in only two months at a cost of 
US$1 million [13] - to the HGP endeavor, which took 
11 years and cost US$3 billion [14]. Another milestone us- 
ing massively parallel pyrosequencing is the work carried 
out on the Neanderthal genome by Paabo and co-workers 
[15]. Recently, the read length of the platform has in- 
creased from 450 bases to rival that of routine Sanger 
sequencing at 700 bases. Massively parallel pyrose- 
quencing transcended Sanger sequencing by enormous- 
ly increasing throughput. However, the research commu- 
nity seemed to have developed an insatiable appetite for 
sequence data, which is not easily satisfied using bio- 
luminescent SBS. 

3.2 Reversible dye terminator SBS 

A fundamental gain in sequencing throughput was 
achieved when the second massively parallel sequencing 
system to be commercialized was launched in 2006. The 
technology is based on SBS using four colored reversible 
dye termination nucleotides and was first outlined in 2001 
by a small company, Solexa [16, 17], which was later ac- 
quired by Illumina [18]. 

SBS using reversible dye termination begins with 
fragmentation of the DNA and ligation of adaptors. In- 
stead of using emulsion PCR to spatially separate each 
DNA template, as in the massively parallel pyrosequenc- 
ing approach, another technique is employed. The DNA 
templates are hybridized via the adaptors to a planar sur- 
face, where each DNA template is clonally amplified by 
solid-phase PCR, also known as bridge amplification. 
This creates a surface with a high density of spatially dis- 
tinct clusters, each cluster of which contains a unique 
DNA template. These are primed and sequenced by 
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passing the four spectrally distinct reversible dye termi- 
nators in a flow of solution over the surface in the pres- 
ence of a DNA polymerase. Only single base extensions 
are possible due to the 3' modification of the chain-ter- 
mination nucleotides, and each cluster incorporates only 
one type of nucleotide, as dictated by the DNA template 
forming the cluster. The incorporated base in all clusters 
is detected by fluorescence imaging of the surface before 
chemical removal of the dye and terminator, generating 
an extendable base that is ready for a new round of 
sequencing. The most common sequencing errors pro- 
duced in reversible dye termination SBS are substitu- 
tions [18, 19]. 

One of the major drawbacks of reversible dye termi- 
nation chemistry is the limitation in read length, as it is 
difficult to reach 100% efficiency of base incorporation 
and cleavage in each cycle. When the system was 
launched, it featured a high- quality read length of only 
30-36 bases [18, 20], dramatically less than the read 
lengths offered by both Sanger sequencing and massive- 
ly parallel pyrosequencing. In fact, it was even lower than 
that of the "plus-minus" method developed in 1975 by 
Sanger and Coulson [21], but a dramatic increase in 
sequence data output compensated for this drawback. 
Substantial improvements in both read length and 
throughput have been achieved since its introduction, 
enabling reversible dye termination SBS to become the 
most successful massively parallel sequencing platforms. 
The short read length initially prohibited its use in large 
de novo assembly applications, but made it suitable for 
re-sequencing and data counting applications, e.g. 
sequencing of transcriptomes, RNA-Seq and sequencing 
of DNA fragments involved in protein-DNA interactions, 
ChlP-Seq [18]. Improvements in read length and develop- 
ment of novel de novo assembly algorithms, such as 
SOAPdenovo [22], has now enabled the use of reversible 
dye termination SBS in de novo assembly of large mam- 
malian genomes, e.g. the giant panda genome [23]. 
Recently, Illumina released its rapid Hiseq 2500 system 
capable of sequencing a human genome in 27 h, and 
enabling read lengths up to 150 bases through shorter 
sequence cycling times. 

3.3 Sequencing by chained ligation 

In nature, DNA ligases are essential enzymes for repairing 
nicks and breaks in DNA and are involved in high fidelity 
reactions in both DNA replication and repair [24]. In 2005, 
Church and Shendure demonstrated that the reaction 
they catalyze could be used to sequence short DNA frag- 
ments immobilized on beads trapped in a polyacrylamide 
gel in a process called polony sequencing [25]. SOLID 
(Sequencing by Oligonucleotide Ligation and Detection), 
the third commercial massively parallel sequencing plat- 
form, based on a more mature version of this technique, 
reached the market in 2007 [26]. 



Initially, the SOLiD platform yielded many more bases 
than the Illumina platform, but the recent upgrades to the 
Hiseq2000 chemistry make this instrument's yield signif- 
icantly higher than that for SOLiD. However, in the library 
preparation steps the SOLiD platform resembles mas- 
sively parallel pyrosequencing. Clonal amplification is 
achieved by emulsion PCR after creating an adaptor-lig- 
ated fragment library. The enriched beads are then ran- 
domly deposited on a glass slide and attached to its sur- 
face by a 3' modification introduced on the targets cover- 
ing the beads. In subsequent sequencing by chained lig- 
ation, oligomers of eight nucleotides, octamers, are used 
as detection probes instead of single nucleotides. Each 
octamer contains two known bases at the 3' end followed 
by three degenerate nucleotides and three nucleotides 
that can hybridize to any other nucleotide, called univer- 
sal nucleotides. Four spectrally distinct dyes are 
employed, each of which is carried by four octamers, cre- 
ating a total probe set of 16 octamers, covering all possi- 
ble combinations of the two known bases. The dye is 
attached to the 5' end of each octamer. To initiate the first 
sequencing cycle, the DNA-covered beads are incubated 
with a sequencing primer, the 16 octamer probes and lig- 
ase. Only a perfectly hybridized probe will be joined to the 
sequencing primer by the ligase. The probes are imaged 
to decode the first two bases, and then chemically 
cleaved to remove the last three 5' bases. This cycle of 
probe hybridization and ligation, fluorescent imaging and 
chemical cleavage is repeated ten times, interrogating 
every fifth base. The created fragment is then stripped 
from the beads and a second ligation round is performed 
using a sequencing primer annealing one base upstream 
of the previous one. Hence, each base is associated with 
two color calls. Progressively shifting the position of the 
sequencing primer and repeating the ligation round using 
ten cycles each time yields color calls, which can be de- 
coded to obtain a linear sequence in color space. Trans- 
lating from color space to an ordinary nucleotide 
sequence is most accurately performed by aligning the 
color space reads to a color space reference genome. As a 
consequence of the color space alignment, errors can be 
efficiently corrected since only certain color combinations 
are allowed, facilitating distinction between nucleotide 
substitutions, represented by a two-color change in adja- 
cent nucleotides, and sequence errors [26-28]. The most 
common sequencing errors in sequencing by chained lig- 
ation are substitutions [18]. 

Life Technologies has recently announced a replace- 
ment of the cumbersome emulsion library preparation step 
with a technique called wildfire, which resembles the 
bridge amplification performed in the Illumina platform. 
This promises a reduction of the library preparation time 
as well as a higher density on the chip surface, enabling a 
higher throughput. The SOLiD system is best suited for re- 
sequencing projects demanding low error rates [26], tran- 
scriptome sequencing and tag counting, e.g. ChlP-Seq, 
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due to its short read length, error correction system and 
massive output of data [18]. The color space approach 
efficiently detects sequence errors, but the downstream 
data analysis has proven to be cumbersome, leading to the 
development of few pertinent open source tools [29]. How- 
ever, it is possible to exploit the ligation reaction in 
sequencing without applying the color space approach. 

3.4 Sequencing by unchained ligation 

In 2010, Complete Genomics Inc published a proof-of- 
principle paper demonstrating its complete human 
genome sequencing capabilities, using a sequencing 
method sharing several features with polony sequencing 
[25, 30]. Rather than selling sequencing instruments, 
which has been the dominant business model so far, the 
company provides an outsourcing service for human DNA 
sequencing including library preparation, sequencing 
and basic analysis. 

Complete Genomics performs sequencing by un- 
chained ligation using its combinatorial probe-anchor lig- 
ation (cPAL) technology. The DNA to be sequenced is 
fragmented to 500 bases and four adaptors are inserted 
into each genomic fragment by repeated cutting with re- 
striction enzymes and intramolecular ligation. The result- 
ing circles are then amplified in solution into coils of sin- 
gle-stranded DNA [31], called DNA nanoballs, composed 
of repeated copies of the original circles. The DNA 
nanoballs are then randomly attached to ordered spots on 
a planar surface. Each spot is small enough to contain only 
one DNA nanoball. Complete Genomics' approach of 
sequencing by unchained ligation uses anchor probes to 
find adaptor sites and detection probes consisting of nine 
bases (nonamers) and four dyes. The anchor probe and 
detection probe are joined by hybridization and ligation to 
decode one base each cycle. The nanoballs can be de- 
coded, one nucleotide at a time, on both sides of each 
adaptor site. Briefly, an anchor oligonucleotide is hy- 
bridized to one of the four adaptors, and four probes, each 
carrying a spectrally distinct dye, are used to interrogate 
a base position adjacent to the adapter. The dye is con- 
nected to a known base at a specific position in the non- 
amer probe, while the rest of the positions are filled by de- 
generate nucleotides. Varying the position of the known 
base in the probe allows positions 1-5 in the DNA tem- 
plate adjacent to the adaptor to be read. Creation of an ex- 
tended anchor probe by ligation of two anchor probes al- 
lows decoding of positions 6-10 adjacent to the adaptor. 
The anchor-probe complex is stripped off after fluorescent 
detection of each hybridization and ligation reaction. This 
resets the system. Hence, each base call is unchained, 
thereby improving sequence quality because every 
decoded base is independent of the completeness of 
earlier cycles. The whole procedure is then repeated for 
both sides of each adaptor site, and the resulting 10-mers 
are merged into 35-base-paired-end reads [30]. 



The reasons that Complete Genomics are keeping 
their sequencing platform in-house are, at least partly, re- 
lated to the considerable challenge of mapping and as- 
sembling the highly repetitive human genome using 
short, 35-base-paired-end reads and the rather demand- 
ing cPAL library preparation. Illumina has responded to 
Complete Genomics' initiative by also offering complete 
human genome sequencing as a service, using reversible 
dye termination chemistry. It remains to be seen if most 
sequencing will be performed in-house at hospitals and 
sequencing centers, or outsourced to sequencing compa- 
nies. 

3.5 Ion-sensitive SBS 

All massively parallel sequencing platforms described so 
far use optic detection to decipher a DNA sequence. 
Exchanging optic detection for electric detection would 
potentially have several advantages since it would avoid 
the need for modified nucleotides, expensive optics and 
the time-consuming acquisition and processing of large 
amounts of fluorescent image data. Hence, it promises 
cheaper, space-efficient and more rapid sequencing. 
Pourmand and colleagues [32] outlined an electric detec- 
tion SBS approach in 2006, demonstrating that DNA syn- 
thesis events could be detected by electric charge per- 
turbations using a voltage-clamp amplifier. In 2010, Roth- 
berg and colleagues [33] demonstrated a novel sequenc- 
ing platform, based on the work by Pourmand et al. [34], 
the Ion Torrent, which relies on electric rather than fluo- 
rescent signals to obtain sequence data. 

The process is very similar to that of the massively par- 
allel pyrosequencing, developed earlier by Rothberg and 
co-workers. It involves use of the same library preparation 
steps, with an adapter-ligated fragment library, which is 
clonally amplified on beads using emulsion PCR. A DNA 
polymerase is used to incorporate nucleotides on a 
primed DNA template with a predetermined flow of nu- 
cleotides. However, the focus is shifted from the PPi to the 
other by-product of nucleotide incorporation, the hydro- 
gen ion. The enriched beads from the emulsion PCR are 
deposited in small wells (each equipped with a sensor ca- 
pable of detecting free protons) on an ion chip. As the 
DNA polymerase adds nucleotides to the DNA template 
hydrogen ions are released, and the resulting free proton 
shift is detected by the sensor and transformed to an elec- 
tric signal. The ion chip is based on the most widely used 
technology to construct integrated circuits, facilitating 
low-cost, large-scale production and should confer excel- 
lent scalability. However, as in massively parallel pyrose- 
quencing, homopolymeric regions give rise to insertions 
and deletions, indels, due to the nonlinear electric re- 
sponse upon incorporation of more than five nucleotides 
[33]. 

Sequencing of Gordon Moore's genome [33] demon- 
strated that it is possible, but perhaps not recommend- 
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able, to use the Ion Torrent system for large mammalian 
re-sequencing. The Ion Torrent system has a fast se- 
quencing run time of just 2 h, but resembles massively 
parallel pyrosequencing when it was launched in terms of 
throughput and read length. It seems most suited for de 
novo sequencing and re-sequencing of bacterial genomes 
or sequencing targeted regions of more complex organ- 
isms' genomes. The recent outbreak of the Shiga toxin- 
producing substrain of Escherichia coli provided an ex- 
cellent example of the value of a sequence platform with 
a modest output but fast sequencing run time when mon- 
itoring outbreaks of pathogens [35, 36]. The successor to 
Ion Torrent is called the Ion Proton, scheduled to be re- 
leased in 2012, and is based on the same chemistry as its 
predecessor but is expected to have about 60-fold more 
wells on its Ion Proton II chip. This massive increase in 
throughput should enable the sequencing of a human 
genome within a few hours of sequencing. 

3.6 Current issues in massively parallel sequencing 

The massively parallel sequence methods are quite di- 
verse in terms of sequencing biochemistry, but they share 
many common features. Their library preparation steps 
begin with random fragmentation of DNA followed by lig- 
ation of platform-specific adaptors at the end of each frag- 
ment. These adaptors are then used in amplification of the 
fragment on a solid surface by a polymerase, except in 
cPAL amplification, which is performed in solution. The 
amplification products are spatially clustered on an array 
before sequencing. The sequencing process itself is per- 
formed by an orchestrated automated series of enzyme- 
driven biochemical and fluorescent imaging data acquisi- 
tion steps. Only the newest system, the Ion Torrent, based 
on free proton shifts, is capable of electric detection. All of 
these platforms also have the capability to read both ends 
of a DNA fragment, called paired-end sequencing. This 
feature is instrumental in resolving repetitive regions in 
genomes and quantifying transcript isoforms in RNA-Seq. 

The rate-limiting step in the sequencing process has 
traditionally been the sequencing reaction. However, that 
started to change towards the end of the HGP, when the 
capacity of sequencing instruments began to exceed the 
rate at which new samples could be prepared for 
sequencing. Current sequencing instruments produce 
several orders of magnitude more data than conventional 
Sanger sequencing, shifting the rate-limiting steps to 
library preparation and data analysis. The challenge is to 
keep the processes of sample preparation, sequencing 
reaction and data analysis balanced. Hence, automation 
of library preparation and quality control steps has played 
a vital role in keeping up to speed with the increase in 
sequencing power. The generally short read lengths cou- 
pled to the enormous amount of data to be analyzed and 
reduced raw accuracy, compared to Sanger sequencing, 
has introduced many challenges in the downstream data 



analysis [18]. However, the development of new algo- 
rithms tailored to the new kinds of data generated and the 
use of supercomputer clusters for the data analysis have 
alleviated some of these challenges. 

4 Massively parallel sequencing - 
single-molecule sequencing 

Massively parallel consensus sequencing has become the 
dominant sequencing technology, but other approaches 
have emerged that avoid amplification of the DNA tem- 
plate prior to sequencing. The aim of these technologies 
is to sequence single DNA molecules, preferably in real 
time. Potential benefits of using single-molecule se- 
quencing are: the minimal quantities of input DNA re- 
quired; elimination of amplification bias; asynchronous 
synthesis; fast turnaround times; and the capacity to in- 
vestigate the characteristics of individual DNA mole- 
cules. A comparison of consensus and single-molecule se- 
quencing, as well as the most common errors for each se- 
quencing technology is shown in Fig. 2. 

4.1 Reversible single-dye terminator SBS 

Helicos Biosciences introduced the first single-molecule 
DNA sequencer, the Heliscope sequencer system in 2008 
[37]. The system uses labeled reversible chain-terminat- 
ing nucleotides in an SBS approach with immobilized 
DNA on a planar surface [38, 39]. Hence, it shares many 
features with massively parallel consensus sequencers in 
set-up, biochemistry, sequence run time and throughput. 
It was used in the first single-molecule sequencing of a 
human genome in 2009 [40] and can be used for amplifi- 
cation-free direct RNA sequencing [41]. However, the 
platform has been on the market since 2008 and seems to 
be struggling in the fierce competition. The reasons for 
this probably lie in its the high raw read error rates (>5%) 
and short read length (-32 bases) [42]. 

4.2 Zero-mode waveguide sequencing 
with immobilized polymerase 

Pacific Biosciences presented the first single-molecule, 
real-time SBS sequencing platform in a proof-of-principle 
paper in Science in 2009. The platform introduced sever- 
al interesting novel features, including immobilization of 
the DNA polymerase instead of the DNA, real-time moni- 
toring of the synthesis process and potential read lengths 
of several kilobases of DNA [43]. 

The single-molecule real-time (SMRT) sequencing re- 
lies on a processive DNA polymerase immobilized via a 
biotin-streptavidin linkage at the bottom of a nanostruc- 
ture called a zero-mode waveguide (ZMW) [44]. This 
structure is critical because it provides a confined optical 
observation space of -100 zeptoliters (100 x 10~ 21 L), en- 
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Figure 2. (A) Comparison of number of DNA molecules required for generating a base call in consensus sequencing and single-molecule sequencing 
(B) The most common type of sequencing errors per sequencing technology. 



abling minimization of background noise, parallelization, 
and monitoring of single-molecule DNA polymerization. 
The DNA template is allowed to diffuse into the ZMV in 
the presence of primer and nucleotides with fluorescent 
labels attached to the phosphate chain. The DNA poly- 
merase latches onto the DNA template and begins incor- 
porating the distinctly labeled nucleotides. The DNA se- 
quence is determined by recording each nucleotide in- 
corporation event dictated by the DNA template. Nu- 
cleotides that are being incorporated remain for a longer 
time in the optical detection volume than freely diffusing 
nucleotides, facilitating the separation of background 
noise generated by a high concentration of labeled nu- 
cleotides from true nucleotide incorporation events. The 
label attached to the phosphate chain is cleaved off dur- 
ing the elongation process. The polymerases are random- 
ly immobilized among the ZMWs, leading to only a third 
of them being occupied by a single polymerase. The rest 
of the ZMWs are either empty or contain two or more poly- 
merases, which dramatically reduces the potential ca- 
pacity of the ZMW array. SMRT sequencing has high er- 
ror rates, mainly due to failure to detect all incorporations 
leading to indels [42, 43]. 

SMRT sequencing has several interesting features 
that can be exploited. A circular template can be used, en- 
abling a high-quality consensus sequence to be obtained 
by allowing the polymerase to make multiple passes along 
the same DNA template [45]. The physical read length 
can be extended by turning off the laser used to illuminate 
the ZMWs at pre-determined intervals, generating sub- 
reads from the same genomic fragment in a process called 
strobe sequencing [46]. A truly novel feature is the ability 



to capture kinetic information by observing the activity of 
an immobilized enzyme in real time. This enables the de- 
tection of chemical modifications to bases, such as 
methylation [47] , and the observation of translation kinet- 
ics by employing labeled tRNAs and immobilizing a ribo- 
some instead of a polymerase in the ZMW [48]. The SMRT 
sequencing approach has great potential with unique fea- 
tures and fast sequencing reaction runs, but suffers from 
raw error rates exceeding 10%, largely due to indels, and 
modest outputs (2 x 75 000 ZMWs per SMRT cell allowing 
to about a third as many reads), which hamper its use in 
many applications. 

4.3 Nanopore sequencing 

In 1996, Deamer and colleagues demonstrated that single- 
stranded DNA (1.1-1.3 nm; [49]) or RNA molecules could 
be driven through an ion channel in a lipid bilayer by an 
electrical field [50]. This cited paper outlined the basic 
principles of a nanopore sequencing method. The DNA to 
be sequenced would be decoded by monitoring an elec- 
tric current as the DNA passed through a nanopore (a 
channel 1-10 nm in diameter) in a membrane, which 
could be either nanofabricated or created by an engi- 
neered protein. Nanopore sequencing can potentially of- 
fer minimal sample preparation, electrical or fluorescent 
readout and read lengths of several kilobases from single 
molecules of DNA in real time [42, 51]. 

Several problems must be resolved before nanopore 
sequencing can be fully exploited. The nanopores are too 
long to provide single-base resolution, as several nu- 
cleotides in the pore contribute to each change in electric 
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current. In addition, the high speed of DNA translocation 
through the nanopores complicates the electrical meas- 
urements, making the nucleotide identification signals 
difficult to interpret [42]. However, solutions to these 
problems have been proposed. IBM is developing a nano- 
pore matrix, which resembles a transistor with alternating 
layers of metal and dielectric material. Simulations indi- 
cate that modulating the current in this "transistor" can 
control the speed of DNA translocation [52] . However, the 
effect needs to be verified experimentally and a potential 
device still needs to achieve single-base resolution [42, 
52]. Genia is a nanopore sequencing company using a 
lipid bilayer with a single nanopore connected to a sensor 
on an integrated circuit. They aim to reach a chip con- 
taining ~1 million such sensors for the commercial launch 
of their system, which is scheduled to 2013. However, 
there are no reports from non-vendors that have tested the 
Genia system yet. Another commercial player trying to 
achieve nanopore sequencing is Oxford Nanopore Tech- 
nologies. They currently have two DNA sequencing 
strategies: strand and exonuclease sequencing. Exonu- 
clease sequencing employs a modified a-hemolysin with 
an attached exonuclease, situated within a synthetic 
membrane with high electronic resistance. The exonu- 
clease cleaves off single nucleotides and feeds them into 
the pores. Each individual nucleotide can then be detect- 
ed from their distinct electrical signal as they transiently 
bind to a cyclodextrin molecule when passing through the 
pore [53]. Strand sequencing relies on a polymerase (or 
other DNA-modifying enzyme) feeding single-stranded 
DNA into the pore and detecting the electrical signal of 
three, for example, bases interacting with a specific 
region of the protein pore. 

Recently, Oxford Nanopore Technologies announced 
its ability to sequence the entire 48-kilobase Lambda 
genome on both sense and antisense strands using their 
strand sequencing method. Commercialization of the 
nanopore-based sensing chemistry on an electronic- 
based platform, the GridlON, and a disposable portable 
device for electronic single molecule sensing, the Min- 
ION, is scheduled for 2012. Oxford Nanopore Technolo- 
gies claims competitive accuracy and ultra-long read- 
length single-molecule data, which will deliver a complete 
human genome in 15 min using multiple GridlONs. It is 
too early to tell whether Oxford Nanopore Technologies 
can fully deliver on these promises, and the acid test for 
the sequencing technique will be the assessment of the 
early tester. However, if the promise of competitive accu- 
racy and ultra-long reads holds true it will cause another 
paradigm shift in the sequencing field. 

5 Future perspectives 

Sequencing technologies have advanced at a rapid pace 
in recent years and the advance shows no signs of slow- 



ing down. This has created an imbalance between the 
sequencing per se and other procedures involved, both 
upstream and downstream, i.e. sample preparation and 
sequence data analysis and storage. Bar coding of sam- 
ples coupled to automation of steps prior to sequencing 
has the potential to handle most upstream process issues. 
The true challenges lie in the downstream steps of 
sequence data analysis, handling and storage. The 
assembly and mapping algorithms need to become more 
efficient and accurate to handle the massive amounts of 
sequence data and extract as much pertinent information 
from them as possible. Long-term storage of sequence 
data and subsequent analysis is cumbersome and expen- 
sive, thus, in the future, it might be cheaper and easier to 
simply re-sequence samples as the information is needed. 
There are, of course, several challenges remaining in the 
actual sequencing reactions. Two of the most urgent are 
increasing the read lengths and the quality of the 
sequence reads, enabling resolution of common repeats 
and duplicated regions. Currently, the massively parallel 
sequencing and single-molecule platforms are comple- 
mentary, and no single technology is best suited for all ap- 
plications. Thus, it remains to be seen which sequencing 
technology will cause the next sequencing paradigm 
shift. 

Sequencing has traditionally been reserved for large 
and very well-funded research centers, but the cost per 
base sequenced started to decrease during the end of the 
HGP. The pace of cost reduction accelerated even further 
with the launch and subsequent evolution of the mas- 
sively parallel sequencing instruments seen in recent 
years. What was once restricted to a privileged few can 
now be accessed by many, leading to a revolution in the 
availability of sequencing power. 

The democratization of sequencing is evident from 
the constantly increasing number of applications and 
analyses that involve sequence data. Sequencing is today 
applied in diverse areas, such as forensics, agriculture, 
biofuel production, paleontology, domestication, and 
medicine to name but a few. This ongoing revolution 
promises to bring sequencing to nearly every aspect of 
life. 
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